SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "db:Swepub ;pers:(Lu Zhonghai);pers:(Song Wenqing)"

Sökning: db:Swepub > Lu Zhonghai > Song Wenqing

  • Resultat 1-7 av 7
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Hui, et al. (författare)
  • Symmetric-Mapping LUT-Based Method and Architecture for Computing X-Y-Like Functions
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 1549-8328 .- 1558-0806. ; 68:3, s. 1231-1244
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a new method and hardware architecture to compute the functions expressed as XY ( X and Y are arbitrary floating-point numbers), which can support arbitrary Nth root, exponential and power operations. Because of the complexity of direct computation, we usually convert it to logarithm, multiplication, and antilogarithm operations. Traditional approaches suffer from long latency, large area and high power consumption. To solve this problem, we propose a symmetric-mapping lookup table (SM-LUT) to be capable of computing log(2) x (x is an element of [1, 2]) and 2 x (x is an element of [0, 1]) simultaneously. It lays the foundation for computing XY. To further improve hardware performance of our architecture, we propose a multi-region address searcher to speed up the calculation of SM-LUT. In addition, we use an optimized Vedic multiplier to shorten the critical path and improve the efficiency of multiplication, which is included in computing X-Y. Under the TSMC 40nm CMOS technology, we design and synthesize a reference circuit to compute X-Y with a maximum relative error of 10(-3). The report shows that the reference circuit achieves the area of 14338.50 mu m(2) and the power consumption of 4.59 mW at the frequency of 1 GHz. In comparison with the state-of-the-art work under the same input range and similar precision, it saves 78.57% area and 80.42% power consumption for (N)root R computation and 82.89% area and 81.89% power consumption for R-N computation averagely. On top of that, our architecture reduces the computation latency by 62.77% averagely and has one more order of magnitude of energy efficiency than others.
  •  
2.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective
  • 2020
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 28:6, s. 1540-1544
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional neural networks (CNNs) have emerged as one of the most popular ways applied in many fields. These networks deliver better performance when going deeper and larger. However, the complicated computation and huge storage impede hardware implementation. To address the problem, quantized networks are proposed. Besides, various convolutional structures are designed to meet the requirements of different applications. For example, compared with the traditional convolutions (CONVs) for image classification, CONVs for image generation are usually composed of traditional CONVs, dilated CONVs, and transposed CONVs, leading to a difficult hardware mapping problem. In this brief, we translate the difficult mapping problem into the sparsity problem and propose an efficient hardware architecture for sparse binary and ternary CNNs by exploiting the sparsity and low bit-width characteristics. To this end, we propose an ineffectual data removing (IDR) mechanism to remove both the regular and irregular sparsity based on dual-channel processing elements (PEs). Besides, a flexible layered load balance (LLB) mechanism is introduced to alleviate the load imbalance. The accelerator is implemented with 65-nm technology with a core size of 2.56 mm(2). It can achieve 3.72-TOPS/W energy efficiency at 50.1 mW, which makes it a promising design for embedded devices.
  •  
3.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
  • 2019
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 8:4
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.
  •  
4.
  • Chen, Qinyu, et al. (författare)
  • Smilodon : An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields such as image and video recognition, recommender systems, and natural language processing. However, the massive size and intensive computation loads prevent its feasible deployment in practice, especially on the embedded systems. As a highly competitive candidate, low bit-width CNNs are proposed to enable efficient implementation. In this paper, we propose Smilodon, a scalable, efficient accelerator for low bit-width CNNs based on a parallel streaming architecture, optimized with a task partitioning strategy. We also present the 3D systolic-like computing arrays fitting for convolutional layers. Our design is implemented on Zynq XC7ZO20 FPGA, which can satisfy the needs of real-time with a frame rate of 1, 622 FPS throughput, while consuming 2.1 Watt. To the best of our knowledge, our accelerator is superior to the state-of-the-art works in the tradeoff among throughput, power efficiency, and area efficiency.
  •  
5.
  • Gao, Qian, et al. (författare)
  • Dynamic and Traffic-Aware Medium Access Control Mechanisms for Wireless NoC Architectures
  • 2021
  • Ingår i: 2021 Ieee International Symposium On Circuits And Systems (ISCAS). - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • Wireless NoC (WiNoC) has low latency and simple wiring, which can reduce the energy consumption caused by the metal interconnection in traditional NoC architectures. However, traditional time division based media access control (MAC) mechanism in WiNoC is not aware of different wireless interfaces' (WIs) traffic demands, resulting in an unreasonable distribution of wireless communication channels and degradation in performance. Hence, in order to dynamically allocate wireless channels to the WIs based on their traffic demands, a dynamic and traffic-aware MAC mechanism is required. In this paper, we design a traffic demand predictor for each WI based on its current and history traffic conditions. According to the predicted demands, we are able to allocate access to wireless channels dynamically and switch between two kinds of time division based MAC mechanisms. Simulations under various conditions indicate that the average delay decreases by 30% and 20% on average compared with a traditional MAC mechanism and an existing dynamic time division based one, respectively. Moreover, the network with the dynamic and traffic-aware MAC enters the saturation point at a higher packet injection rate.
  •  
6.
  • Shen, Sirui, et al. (författare)
  • A Hierarchical Parallel Discrete Gaussian Sampler for Lattice-Based Cryptography
  • 2022
  • Ingår i: 2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1729-1733
  • Konferensbidrag (refereegranskat)abstract
    • Discrete Gaussian sampling is one of the important components in lattice-based cryptosystems which are promising candidates for post-quantum cryptographic algorithms. For sufficient security and satisfactory performance, the Knuth-Yao algorithm is an efficient way to implement discrete Gaussian samplers. Nevertheless, most polynomials in lattice-based cryptography have 256 coefficients or more, which suffers from long latency to complete the sample generation. In this paper, the first parallel discrete Gaussian sampler with hierarchical structure is proposed, while keeping statistical distance to the actual distribution. Based on the imbalanced visiting frequency of the probability matrix, a three-stage generation strategy is adopted with hierarchical bit search units (BSUs) that can greatly reduce area consumption of the repeated costly lookup tables. Besides the architecture improvement, a lowest-set-bit scanning scheme is introduced to BSUs. Moreover, the parallelism of our design provides obfuscation ability against side-channel attacks (SCAs). A practical hardware implementation of discrete Gaussian distributions with sigma = 3.33 on the Xilinx Virtex-5 XC5VLX30 FPGA device spends 26.12 ns on average to generate 256 samples, consuming 994 slices. Results have verified its advantages of area efficiency over the state-of-the-arts (SOAs).
  •  
7.
  • Song, Wenqing, et al. (författare)
  • Heterogeneous Reconfigurable Accelerator for Homomorphic Evaluation on Encrypted Data
  • 2024
  • Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 12, s. 11850-11864
  • Tidskriftsartikel (refereegranskat)abstract
    • Homomorphic encryption (HE) enables third -party servers to perform computations on encrypted user data while preserving privacy. Although conceptually attractive, the speed of software implementations of HE is almost impractical. To address this challenge, various domain -specific architectures have been proposed to accelerate homomorphic evaluation, but efficiency remains a bottleneck. In this paper, we propose a homomorphic evaluation accelerator with heterogeneous reconfigurable modular computing units (RCUs) for the Brakerski/Fan-Vercauteren (BFV) scheme. RCUs leverage operator abstraction to efficiently perform basic sub -operations of homomorphic evaluation such as residue number system (RNS) conversion, number theoretic transform (NTT), and other modular computations. By combining these sub -operations, complex homomorphic evaluation operations like multiplication, rotation, and addition are efficiently executed. To address the high demand for data access and improve memory efficiency, we design a coordinate -based address encoding strategy that enables in -place and conflict -free data access. Furthermore, specific optimizations are performed on the core sub -operations such as NTT and automorphism. The proposed architecture is implemented on Xilinx Virtex-7 and UltraScale+ FPGA platforms and evaluated for polynomials of length 4096. Compared to state-of-the-art accelerators with the same parameter set, our accelerator achieves the following advantages: 1) 2.04x to 3.33x reduction in the area -time product (ATP) for the key sub -operation NTT, 2) 1.08x to 7.42x reduction in latency for homomorphic multiplication with higher area efficiency, and 3) support for a wider range of homomorphic evaluation operations, including rotation, compared to other BFV-based accelerators.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-7 av 7
Typ av publikation
tidskriftsartikel (4)
konferensbidrag (3)
Typ av innehåll
refereegranskat (7)
Författare/redaktör
Li, Li (7)
Fu, Yuxiang (7)
Chen, Qinyu (3)
Cheng, Kaifeng (2)
visa fler...
Zhang, Chuan (2)
Wang, Xinyu (2)
Shen, Sirui (2)
Huang, Yan (1)
Chen, Hui (1)
Wang, Yilin (1)
Sun, Rui (1)
Gao, Qian (1)
Yu, Zongguang (1)
Yang, Heping (1)
Shao, Xinyu (1)
Xu, Congwei (1)
visa färre...
Lärosäte
Kungliga Tekniska Högskolan (7)
Språk
Engelska (7)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (4)
Teknik (3)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy